Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: njwhite Date: Jan 30, 2014 1:36am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Oh really? I didn't know they had pre-built modules for that (the http://finereader.abbyy.com/corporate/tech_specs/ page doesn't mention it). Note that Ancient Greek is quite different to modern Greek (more diacritics, different vocabulary).

Is there some way you can automatically select the language to OCR? Because as I mentioned at present all the Ancient Greek books I've looked at in the archive appear to have been treated as latin, with unusable results.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffJeff Kaplan Date: Jan 30, 2014 1:32pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

yes, if you use our html5 uploader with Chrome, Firefox or Safari at archive.org/upload you will see a language cell. Click in it and the dropdown menu includes Greek, Ancient.

Reply to this post
Reply [edit]

Poster: aibek Date: Jan 30, 2014 7:19pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

To see what Jeff meant by ‘Abbyy module’ see the attached extract from a log file.

I suggest that you upload a PDF file containing Ancient Greek exclusively, in the manner suggested by Jeff, and check what happens.

Attachment: Module-AbbyyXML.txt

Reply to this post
Reply [edit]

Poster: tfmorris Date: Feb 11, 2014 8:56am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Does anyone look at improving OCR quality on an ongoing basis, whether through the use of Tesserract or other means? Is OCR ever re-done after the initial pass?

I did a study recently scoring the OCR quality of public domain eBooks in IA and found the quality to be all over the map. I suspect that Tesserract could do a better job in many cases, but I also suspect that ABBYY could be improved as well.

I saw some anecdotal evidence that high processing loads on the OCR cluster and the use of "fast mode" was correlated low OCR quality. That seems fine as an interim if the books were later requeued for full processing, but that doesn't seem to happen.

Like Nick, I'd be willing to help with improving the OCR quality.

Tom